AITopics | small model

Collaborating Authors

small model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

Kwon, Soo Min, Sun, Ziteng, Suresh, Ananda Theertha, Jain, Himanshu, Kumar, Sanjiv

arXiv.org Machine LearningMay-12-2026

Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model with importance reweighting, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small model performance over standard GRPO on mathematical benchmarks across both Qwen and Llama models. Specifically, with Qwen2.5-Math-1.5B, we observe an accuracy increase of over 11.6 percentage points over the base model and an additional 6.0 percentage points over GRPO on the Minerva dataset. Interestingly, the larger model (Qwen2.5-Math-7B) trained with CoDistill-GRPO nearly matches standard GRPO performance despite training on small-model rollouts. This highlights CoDistill-GRPO as a cost-effective alternative to GRPO for larger models, yielding an approximate 18% speedup, which may be of independent interest.

large language model, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

2605.08873

Country: North America > United States (0.46)

Genre: Research Report (0.42)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models

Neural Information Processing SystemsMar-17-2026, 22:06:03 GMT

Large language models are usually fine-tuned to align with human preferences. However, fine-tuning a large language model can be challenging. In this work, we introduce $\textit{weak-to-strong search}$, framing the alignment of a large language model as a test-time greedy search to maximize the log-probability difference between small tuned and untuned models while sampling from the frozen large model. This method serves both as (1) a compute-efficient model up-scaling strategy that avoids directly tuning the large model and as (2) an instance of weak-to-strong generalization that enhances a strong model with weak test-time guidance.Empirically, we demonstrate the flexibility of weak-to-strong search across different tasks. In controlled-sentiment generation and summarization, we use tuned and untuned $\texttt{gpt2}$s to improve the alignment of large models without additional training.

large language model, machine learning, natural language, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.44)

Add feedback

On Optimal Caching and Model Multiplexing for Large Model Inference

Neural Information Processing SystemsFeb-16-2026, 18:25:18 GMT

By combining a caching algorithm, namely Greedy Dual Size with Frequency (GDSF) or Least Expected Cost (LEC), with a model multiplexer, we achieve optimal rates in both offline and online settings.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)
(2 more...)

Add feedback

7b97adeafa1c51cf65263459ca9d0d7c-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 09:44:21 GMT

large language model, machine learning, small model, (19 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Maryland > Baltimore (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

SpecTr: Fast Speculative Decoding via Optimal Transport

Neural Information Processing SystemsFeb-12-2026, 14:06:02 GMT

However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

34ec1286b2ccd4794c5ca4ad078b7150-Paper-Conference.pdf

Neural Information Processing SystemsFeb-11-2026, 00:47:48 GMT

knowledge, language model, small model, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Colombia > Meta Department > Villavicencio (0.04)
North America > Canada > Ontario > Toronto (0.04)
(7 more...)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.67)

Industry:

Education (0.67)
Banking & Finance (0.67)
Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Knowledge Management > Knowledge Engineering (0.93)

Add feedback

Stacking Y our Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

Neural Information Processing SystemsFeb-8-2026, 06:37:26 GMT

LLMs are computationally expensive to pre-train due to their large scale.

acc, large language model, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > China > Hong Kong (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > New Finding (0.93)
Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

LLMs contain a LOT of parameters. But what's a parameter?

MIT Technology ReviewJan-7-2026, 11:23:47 GMT

LLMs contain a LOT of parameters. They're the mysterious numbers that make your favorite AI models tick. What are they and what do they do? I am writing this because one of my editors woke up in the middle of the night and scribbled on a bedside notepad: "What is a parameter?" Unlike a lot of thoughts that hit at 4 a.m., it's a really good question--one that goes right to the heart of how large language models work. A large language model's parameters are often said to be the dials and levers that control how it behaves.

dimension, llm, neuron, (14 more...)

MIT Technology Review

Country:

North America > United States > Massachusetts (0.04)
Asia > China (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Speculative Decoding with Big Little Decoder

Neural Information Processing SystemsDec-26-2025, 05:06:17 GMT

The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text.

decoder, name change, speculative decoding, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.58)

Add feedback

On Giant's Shoulders: Effortless Weak to Strong by Dynamic Logits Fusion

Neural Information Processing SystemsDec-24-2025, 22:07:32 GMT

Efficient fine-tuning of large language models for task-specific applications is imperative, yet the vast number of parameters in these models makes their training increasingly challenging.Despite numerous proposals for effective methods, a substantial memory overhead remains for gradient computations during updates.

artificial intelligence, name change, natural language, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.59)

Add feedback